99 research outputs found
Are ambiguous conjunctions problematic for machine translation?
The translation of ambiguous words still poses challenges for machine translation.
In this work, we carry out a systematic quantitative analysis regarding the ability of different machine translation systems to disambiguate the source language conjunctions âbutâ and âandâ. We evaluate specialised test sets focused on the translation of these two conjunctions. The test sets contain source languages that do not distinguish different variants of the given conjunction, whereas the target languages do. In total, we evaluate the conjunction âbutâ on 20 translation outputs, and the conjunction âandâ on 10. All machine translation systems almost perfectly recognise one variant of the target conjunction, especially for the source conjunction
âbutâ. The other target variant, however, represents a challenge for machine translation systems, with accuracy varying from 50% to 95% for âbutâ and from 20% to 57% for âandâ. The major error for all systems is replacing the correct target variant with the opposite one
On context span needed for machine translation evaluation
Despite increasing efforts to improve evaluation of machine translation (MT) by going beyond the sentence level to the document level, the definition of what exactly constitutes a ``document level'' is still not clear. This work deals with the context span necessary for a more reliable MT evaluation. We report results from a series of surveys involving three domains and 18 target languages designed to identify the necessary context span as well as issues related to it. Our findings indicate that, despite the fact that some issues and spans are strongly dependent on domain and on the target language, a number of common patterns can be observed so that general guidelines for context-aware MT evaluation can be drawn
On the same page? Comparing inter-annotator agreement in sentence and document level human machine translation evaluation
Document-level evaluation of machine translation has raised interest in the community especially since responses to the claims of âhuman parityâ (Toral et al., 2018; L¨aubli et al.,2018) with document-level human evaluations have been published. Yet, little is known about best practices regarding human evaluation of machine translation at the documentlevel.
This paper presents a comparison of the differences in inter-annotator agreement between quality assessments using sentence and document-level set-ups. We report results of the agreement between professional translators for fluency and adequacy scales, error annotation, and pair-wise ranking, along with the effort needed to perform the different tasks. To best of our knowledge, this is the first study of its kind
Evaluating the impact of light post-editing on usability
This paper discusses a methodology to measure the usability of machine translated content by end users, comparing lightly post-edited
content with raw output and with the usability of source language content. The content selected consists of Online Help articles from a
software company for a spreadsheet application, translated from English into German. Three groups of five users each used either the
source text - the English version (EN) -, the raw MT version (DE_MT), or the light PE version (DE_PE), and were asked to carry out
six tasks. Usability was measured using an eye tracker and cognitive, temporal and pragmatic measures of usability. Satisfaction was
measured via a post-task questionnaire presented after the participants had completed the tasks
Acceptability of machine-translated content: a multi-language evaluation by translators and end-users
As machine translation (MT) continues to be used increasingly in the translation industry, there is a
corresponding increase in the need to understand MT quality and, in particular, its impact on endusers. To date, little work has been carried out to investigate the acceptability of MT output among
end-users and, ultimately, how acceptable they find it. This article reports on research conducted to
address that gap. End-users of instructional content machine-translated from English into German,
Simplified Chinese and Japanese were engaged in a usability experiment. Part of this experiment
involved giving feedback on the acceptability of raw machine-translated content and lightly postedited (PE) versions of the same content. In addition, a quality review was carried out in collaboration
with an industry partner and experienced translation quality reviewers. The translation qualityassessment (TQA) results from translators reflect the usability and satisfaction results by end-users
insofar as the implementation of light PE both increased the usability and acceptability of the PE
instructions and led to satisfaction being reported. Nonetheless, the raw MT content also received
good scores, especially for terminology, country standards and spelling
A human evaluation of English-Irish statistical and neural machine translation
With official status in both Ireland and the EU, there is a need for high-quality English-Irish (EN-GA) machine translation (MT) systems which are suitable for use in a professional translation environment. While we have seen recent research on improving both statistical MT and neural MT for the EN-GA pair, the results of such systems have always been reported using automatic evaluation metrics. This paper provides the first human evaluation study of EN-GA MT using professional translators and in-domain (public administration) data for a more accurate depiction of the translation quality available via MT
Reading comprehension of machine translation output: what makes for a better read?
This paper reports on a pilot experiment
that compares two different machine translation (MT) paradigms in reading comprehension tests. To explore a suitable
methodology, we set up a pilot experiment with a group of six users (with English, Spanish and Simplified Chinese languages) using an English Language Testing System (IELTS), and an eye-tracker.
The users were asked to read three texts
in their native language: either the original
English text (for the English speakers) or
the machine-translated text (for the Spanish and Simplified Chinese speakers). The
original texts were machine-translated via
two MT systems: neural (NMT) and statistical (SMT). The users were also asked
to rank satisfaction statements on a 3-point
scale after reading each text and answering
the respective comprehension questions.
After all tasks were completed, a post-task
retrospective interview took place to gather
qualitative data. The findings suggest that
the users from the target languages completed more tasks in less time with a higher
level of satisfaction when using translations from the NMT system
Translation dictation vs. post-editing with cloud-based voice recognition: a pilot experiment
In this paper, we report on a pilot mixed-methods experiment investigating the effects on
productivity and on the translator experience of integrating machine translation (MT) postediting (PE) with voice recognition (VR) and translation dictation (TD). The experiment
was performed with a sample of native Spanish participants. In the quantitative phase of the
experiment, they performed four tasks under four different conditions, namely (1)
conventional TD; (2) PE in dictation mode; (3) TD with VR; and (4) PE with VR (PEVR).
In the follow-on qualitative phase, the participants filled out an online survey, providing
details of their perceptions of the task and of PEVR in general. Our results suggest that
PEVR may be a usable way to add MT to a translation workflow, with some caveats. When
asked about their experience with the tasks, our participants preferred translation without the
âconstraintâ of MT, though the quantitative results show that PE tasks were generally more
efficient. This paper provides a brief overview of past work exploring VR for from-scratch
translation and PE purposes, describes our pilot experiment in detail, presents an overview
and analysis of the data collected, and outlines avenues for future work
Document-level machine translation evaluation project: methodology, effort and inter-annotator agreement
Recently, document-level (doc-level) human evaluation of machine translation
(MT) has raised interest in the community after a few attempts have disproved
claims of âhuman parityâ (Toral et al.,
2018; Laubli et al., 2018). However, lit- ¨
tle is still known about best practices regarding doc-level human evaluation. This
project aims to identify methodologies to
better cope with i) the current state-of-theart (SOTA) human metrics, ii) a possible
complexity when assigning a single score
to a text consisted of âgoodâ and âbadâ
sentences, iii) a possible tiredness bias in
doc-level set-ups, and iv) the difference in
inter-annotator agreement (IAA) between
sentence and doc-level set-ups
How much context span is enough? Examining context-related issues for document-level MT
This paper analyses how much context span is necessary to solve different context-related issues, namely, reference, ellipsis, gender,
number, lexical ambiguity, and terminology when translating from English into Portuguese. We use the DELA corpus, which consists
of 60 documents and six different domains (subtitles, literary, news, reviews, medical, and legislation). We find that the shortest context
span to disambiguate issues can appear in different positions in the document including preceding, following, global, world knowledge;
and that the average length depends on the issue types as well as the domain. Additionally, we show that the standard approach of
relying on only two preceding sentences as context might not be enough depending on the domain and issue types
- âŚ